Skip to content

Speed up LineReader::next_line #89

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chris0x44
Copy link

Measuring with a large numbers only CSV, this change showed ~25% improvement in reading speed.

Measuring with a large numbers only CSV, this change showed ~25% improvement in reading speed.
@ben-strasser
Copy link
Owner

I generated a test file as follows:

{
                std::ofstream out("testfile");
                for(int i=0; i<100000000; ++i){
                        out << i;
                        for(int j=0; j<10; ++j)
                                out << ',' << j;
                        out << '\n';
                }
        }

I use

#include "csv.h"
#include <sys/time.h>
long long get_micro_time(){
	timeval t;
	gettimeofday(&t, 0);
	return t.tv_sec*1000000ll+t.tv_usec;
}

#include <iostream>
#include <fstream>

int main(){
        for(int i=0; i<16; ++i)
        {
                long long timer = -get_micro_time();
                int n=0;
                io::LineReader reader("testfile");
                while(char*c = reader.next_line()){
                        ++n;
                }
                timer += get_micro_time();
                std::cout <<  timer << std::endl;
        }
}

to benchmark. I compile with g++ -O3 main.cpp -std=c++0x -o prog -pthread.

The current version has the following running times:

2227115
2241192
2241685
2271418
2254201
2264049
2268043
2288949
2292267
2324893
2344262
2231287
2182160
2156060
2153017
2145480

The proposed version has the following running times

2036938
2020005
2028032
2025394
2032634
2123398
2062249
2066723
2037287
2048502
2050186
2131472
2047372
2023591
2014507
2021027

The proposed version seems to be about 10% faster. I'll have look at the code tomorrow for subtle bugs. If I find none, I'll integrate the PR.

(Some changes are necessary to make it work with GCC 4.8.4. I will therefore not directly merge it as is.)

@chris0x44
Copy link
Author

chris0x44 commented Oct 4, 2019

Interesting result. I was using the similar code (reduced line count to 50000000) with Visual Studio 2019.

Old implementation

32 bit    64bit
1403602   1399353
1396644   1385363
1412428   1385887
1385986   1428209
1302536   1476942
1324161   1387329
1330136   1375160
1304215   1376447
1286355   1403022
1278173   1408659

New implementation

32 bit    64bit
1066176   1009143
1068074   1019310
1036172   1015235
1055584   1008792
1058937   1015928
1053741   1070592
1063439   1145061
1052232   1004978
1055186   1003172
1058393   1001748

Averaged out ~21% and ~26% improvement. Seems like VS was generating not as efficient code as GCC.

@chris0x44
Copy link
Author

Just realized that you were compiling for -std=c++0x. Would you rather not use full C++11 support to maintain backwards compatibility? Just in case, for future changes.

@ben-strasser
Copy link
Owner

My policy with respect to old compilers is that if newer compilers provide a significant advantage, then bumping up the required compiler version can be considered. However, so far neither c++14 nor c++17 have provided anything where I'd say that it is worthwhile.

I did some more benchmarking today. First of, all the running times in the low second range are with respect to a preheated file system cache. They are not severed from disk. When the data actually comes from the hard disk, then we are talking double digit seconds and all the optimizations done here do not matter as they are negligible.

That being said, optimizing the case where data comes from the file system cache has its uses.

The results of my micro benchmarks are really strange. I'm using the same test setup as before. I get the following running times for the current version:

2118994
2122616
2125639
2123606
2118886
2144089
2126694
2127466
2133072
2135363
2138806
2139697
2133833
2134297
2138764
2161570

This is consistent with yesterday. Now all I do is replace

                        int line_end = data_begin;
                        while(buffer[line_end] != '\n' && line_end != data_end){
                                ++line_end;
                        }

with

                        int line_end = data_begin;
                        while(line_end != data_end && buffer[line_end] != '\n'){
                                ++line_end;
                        }

The only difference is the order of the operands of the &&.

The times that I see are:

1910613
1932081
1925265
1924471
1939898
1938497
1932861
1925658
1927700
1928857
1924995
1933352
1937317
1926166
1933898
1933849

The order of those two operands seems to have a significant impact.

When I rerun the version from this PR, I get:

1921082
1918192
1922095
1918119
1917171
1957642
2027880
2028473
2017751
2017715
2021179
2021563
2016718
2021120
2018545
2022833

From what I gather from these numbers is that the best is to swap the order of the &&-operands and let all advanced bit-tricks be.

@ben-strasser
Copy link
Owner

The following version is also bad:

                        int line_end = data_begin;
                        const char*buffer_ptr = buffer.get();
                        while(line_end != data_end && buffer_ptr[line_end] != '\n'){
                                ++line_end;
                        }

The running times that I observe are:

2287366
2286987
2273850
2280842
2277892
2300054
2274304
2276366
2273849
2274207
2273673
2274042
2280661
2275466
2284489
2310198

This does not make sense to me.

@ben-strasser
Copy link
Owner

Clang 9.0.0 produces running times with 2.1 sec regardless of the tested version. I think that the current version on master slightly wins out. However, the difference is so slow that it might be a random fluctuation.

@chris0x44
Copy link
Author

Nice catch. I think I came across this, but thought it was a fluke (didn't look further into it). Now I compared the two versions (check end first vs. check LF first) and the result is very interesting:

size_t get_line_lf_end(char* buffer, int data_begin, int data_end) {
    int line_end = data_begin;
    while(buffer[line_end] != '\n' && line_end != data_end){
            ++line_end;
    }
    return line_end;
}

Compiles to this:

get_line_lf_end(char*, int, int):
        movsx   r8, esi
        cmp     BYTE PTR [rdi+r8], 10
        je      .L1
        cmp     esi, edx
        je      .L1
        add     esi, 1
        movsx   rax, esi
.L3:
        cmp     BYTE PTR [rdi+rax], 10
        mov     r8, rax
        sete    sil
        cmp     edx, eax
        sete    cl
        add     rax, 1
        or      sil, cl
        je      .L3
.L1:
        mov     rax, r8
        ret

While this:

size_t get_line_end_lf(char* buffer, int data_begin, int data_end) {
    int line_end = data_begin;
    while(line_end != data_end && buffer[line_end] != '\n'){
            ++line_end;
    }
    return line_end;
}

Compiles to this:

get_line_end_lf(char*, int, int):
        movsx   rax, esi
        cmp     esi, edx
        jne     .L10
        jmp     .L7
.L13:
        lea     ecx, [rax+1]
        add     rax, 1
        cmp     edx, eax
        je      .L12
.L10:
        cmp     BYTE PTR [rdi+rax], 10
        jne     .L13
.L7:
        ret
.L12:
        movsx   rax, ecx
        ret

You can see that the second implementation generates a much tighter loop, where je .L12 is practically never taken. This is the assembly generated by GCC 9.2 with -O3
Here's the link to the compiler explorer sample: https://godbolt.org/z/KBANiA

I also did a test run with the switched comparison on Windows and it showed ~10%-15% improvement over the original version. Which is quite good.

@chris0x44
Copy link
Author

One more thing regarding the bit-tricks commit. Can you do a testrun with the following testfile:

   {
      std::ofstream out("testfile");
      for (int i = 0; i < 10000000; ++i) {
         out << i;
         for (int j = 0; j < 10; ++j)
            out << ',' << (1000000+j);
         out << '\n';
      }
   }

It's a scenario where each field contains 6 digit numbers instead of a single digit. The reason I ask is that I'm seeing the following results on my machine:
Basline

949544
889969
888445
856085
944465
827497
876116
794432
731962
802601

Bit-trickery

443171
437787
447546
457375
434767
451046
432036
432197
431616
426336

Which amounts to almost twice the parsing speed on Windows and I'm very curious what GCC/Clang makes of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants